feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382
feat(db-builder): integrate Weaver for semconv compliance checking (#97)#382SurbhiAgarwal1 wants to merge 3 commits into
Conversation
✅ Deploy Preview for otel-ecosystem-explorer ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Integrates semantic-convention (semconv) compliance checking into the Explorer DB build pipeline using the OpenTelemetry Weaver CLI, and adds support for publishing and serving per-library README markdown content (content-addressed via a hash) to the frontend.
Changes:
- Add
SemconvEnricherto generate a Weaver registry from instrumentation telemetry and annotate metrics/spans withsemconv_compliance. - Publish library README markdown files to the generated database and backfill/augment instrumentations with
markdown_hash. - Extend frontend types and API helpers to support
semconv_complianceand README loading.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
SEMCONV_INTEGRATION_DETAIL.md |
Adds a technical deep-dive doc describing the semconv enrichment pipeline and schema updates. |
ecosystem-explorer/src/types/javaagent.ts |
Extends TS types to include markdown_hash and semconv_compliance on signals. |
ecosystem-explorer/src/lib/api/javaagent-data.ts |
Adds an API helper to fetch published README markdown files. |
ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/inventory_manager.py |
Adds helpers to scan/read content-addressed README markdown files per version. |
ecosystem-automation/java-instrumentation-watcher/src/java_instrumentation_watcher/__init__.py |
Adds a dev fallback for __version__ when package metadata isn’t present. |
ecosystem-automation/explorer-db-builder/tests/test_semconv_enricher.py |
Adds unit tests for version extraction, YAML generation, and mocked Weaver interactions. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/semconv_enricher.py |
Introduces the Weaver-based semconv compliance enricher. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/metadata_backfiller.py |
Allows markdown_hash to be backfilled across versions. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/main.py |
Wires semconv enrichment into process_version and publishes/augments README markdown during the build. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/database_writer.py |
Adds write_markdown to publish README markdown into the DB output. |
ecosystem-automation/explorer-db-builder/src/explorer_db_builder/__init__.py |
Adds a dev fallback for __version__ when package metadata isn’t present. |
ecosystem-automation/collector-watcher/src/collector_watcher/__init__.py |
Adds a dev fallback for __version__ when package metadata isn’t present. |
|
Hi @SurbhiAgarwal1, I just noticed this branch includes the commit from #380, which is still under review. You can see the effect on Copilot's review: several of its comments here are actually about the README code from #380, not the semconv work. I'd rather wait for #380 to be merged and this branch rebased on main before doing a full review, so the diff reflects only the Weaver integration. Thanks for the work! |
93b9e24 to
f399623
Compare
|
Hi @lucacavenaghi97, I have rebased this branch onto I have also fixed the test failures and formatting issues, so all CI checks are now green and passing. It is ready for your review. Thanks! |
|
Hi @SurbhiAgarwal1, could you run one more rebase, please? There are new conflicts with this branch. Thanks for your patience! |
…pen-telemetry#97) - Implement SemconvEnricher to validate telemetry via OTel Weaver - Insert enrichment stage into the javaagent builder pipeline - Add semconv_compliance field to Metric and Span models - Support dynamic versioning based on instrumentation schema_url
- Added timeout and improved error handling for Weaver subprocess calls. - Implemented conservative semconv compliance mapping on failure. - Added path traversal sanitization for library README names in automation. - Updated frontend loadLibraryReadme to use fetchWithCache. - Added comprehensive unit tests for markdown publishing and build orchestration. - Fixed test assertion mismatches caused by updated error message formats.
293f092 to
59678eb
Compare
- Ran prettier and ruff to fix formatting issues in both frontend and backend. - Fixed markdown lint issues by wrapping long lines and ignoring generated data files. - Updated java-instrumentation-watcher tests to match new sanitization logic.
59678eb to
3f709b9
Compare
Technical Detail: Semantic Convention Integration (Issue #97)
This document provides a technical deep-dive into the implementation of the Semantic Convention compliance pipeline in the
explorer-db-builder.1. Architectural Overview
The integration follows a "sidecar" enrichment pattern. Instead of modifying the core data structures, we introduce a
SemconvEnricherthat evaluates telemetry metadata against standard OTel registries using the OpenTelemetry Weaver engine.Data Flow
InstrumentationData.weaver registry checkagainst a specific semconv version.2. Component:
SemconvEnricherLocation:
explorer_db_builder/semconv_enricher.pyThis is the primary orchestrator for compliance checking.
Transformation Logic
The enricher generates a temporary directory containing:
manifest.yaml: Defines the instrumentation name and the dependency on the official OTel semantic convention registry (e.g.,github.com/open-telemetry/semantic-conventions@v1.37.0).telemetry.yaml: Translates internal metadata into Weaver's definition format.type: metricand attributes using therefkeyword to ensure Weaver validates them against the registry's definitions.type: span, using synthetic IDs based on the instrumentation name and span kind (e.g.,activej-http.SERVER).Weaver Invocation
The enricher calls the
weaverCLI via a subprocess.weaver registry checkexits with code 0, all signals defined in the registry are considered compliant.stderroutput to identify specific signals that failed validation and marks them accordingly.3. Pipeline Integration
Location:
explorer_db_builder/main.pyThe enrichment stage is integrated into
process_versionimmediately after thetransform_instrumentation_formatcall.This placement ensures that:
4. Frontend & Metadata Schema
Location:
ecosystem-explorer/src/types/javaagent.tsThe compliance status is persisted as a
semconv_compliancearray on individual telemetry signals:{ "name": "http.server.request.duration", "unit": "s", "semconv_compliance": ["1.37.0"] }This structure is extensible, allowing an instrumentation to be marked as compliant with multiple semantic convention versions over time.
5. Verification & Testing
Location:
tests/test_semconv_enricher.pyA dedicated test suite validates the following:
manifest.yamlandtelemetry.yamlare valid and follow Weaver's specification.